Skip to content

zdb: better handling for corrupt block pointers#17166

Merged
behlendorf merged 5 commits intoopenzfs:masterfrom
asomers:zdb-corrupt-bp
Aug 12, 2025
Merged

zdb: better handling for corrupt block pointers#17166
behlendorf merged 5 commits intoopenzfs:masterfrom
asomers:zdb-corrupt-bp

Conversation

@asomers
Copy link
Contributor

@asomers asomers commented Mar 21, 2025

When dumping indirect blocks, attempt to print corrupt block pointers rather than abort the program.

Sponsored by: ConnectWise
Signed-off-by: Alan Somers [email protected]

Motivation and Context

When trying to print a file's indirect blocks with "zfs -vvbbbb", if that file contains corrupt block pointers zdb will fail assertions and crash.

Description

Remove three assertions, replacing them with warnings printed to stderr. The rest of zdb does not require those block pointers to be intact.

How Has This Been Tested?

Tested with a dataset that had become corrupted, similarly to #17077 .

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that assertions are not good for the data we read from disks and that might get corrupted, but I am not sure just complaining to stderr is a good solution either. Error messages printed to stderr might be difficult to correlate to the output printed to stdout. Also since we use zdb as a pool verification tool in some CI tests, I am not sure spamming stderr without returning actual error status would be noticed.

@asomers
Copy link
Contributor Author

asomers commented Mar 24, 2025

I agree that assertions are not good for the data we read from disks and that might get corrupted, but I am not sure just complaining to stderr is a good solution either. Error messages printed to stderr might be difficult to correlate to the output printed to stdout. Also since we use zdb as a pool verification tool in some CI tests, I am not sure spamming stderr without returning actual error status would be noticed.

So what would you prefer? I could print the messages to stdout instead, and eventually exit with status 1.

@amotin
Copy link
Member

amotin commented Mar 24, 2025

Exit code would definitely be good. How to better print it I am not sure though. I haven't looked close on that part of the code. But I already saw zdb using stderr for non-error messages, making it useless as error indicator. And as I have noted their correlation with stdout was a pain. So I'd say if we shoudl print the pointer, lets integrate printing of invalid values there also.

@asomers
Copy link
Contributor Author

asomers commented Apr 7, 2025

The latest commit prints all of the errors to stdout instead of stderr. It wasn't possible to print the L1 block's fill error on the same line as the block pointer itself, at least not without buffering potentially massive amounts of text in RAM, so it's printed afterwards. The output looks like this:

       a90000000   L1  DVA[0]=<5:4088dc7a0000:15000> DVA[1]=<2:3a8fade2f000:15000> [L1 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique double size=20000L/c000P birth=7970383L/7970383P fill=16305515943967419494 cksum=000010c4104a06a3:01b38f9c940ef690:543c183d6a120c26:3f99c20b19c3eb3e
...
       a93940000    L0 DVA[0]=<14483840:7928e65e665e1a00:e200000> DVA[1]=<0:148154a00000000:1c39e00> DVA[2]=<8394880:0:1000000> [L0 unallocated] inherit inherit unencrypted BE gang unique triple size=200L/200P birth=1158310211095814315L/4409647458288664576P fill=6480302737822677281 cksum=0a4c2450925129cd:0124e955ddeccd13:0d3c3894ff64c1dd:c5d48c69c7b34245(ERROR: Block pointer type (0) does not match dnode type (19))
...
       a93a60000    L0 HOLE [L0 unallocated] size=200L birth=12694941463093612928L(ERROR: Block pointer type (0) does not match dnode type (19))
...
       a90000000: ERROR: Block pointer fill (16305515943967419494) does not match calculated value (4160680085501909892)
       a94000000   L1  DVA[0]=<0:3de2940b4000:15000> DVA[1]=<4:4174597d5000:15000> [L1 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique double size=20000L/c000P birth=7974332L/7974332P fill=897 cksum=0000108206fb0d15:019223eac4ced4e5:0d3002f5ea989522:7045956156f0242b

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite verbose. I wonder what happen with some code that may try to parse the output (I think we have some in ZTS), and without error status returned still it might be difficult to handle properly.

@asomers
Copy link
Contributor Author

asomers commented Apr 9, 2025

It is quite verbose. I wonder what happen with some code that may try to parse the output (I think we have some in ZTS), and without error status returned still it might be difficult to handle properly.

I could change the exit status to 1 if any corruption is found. However, there is precedent for not doing that. zdb will already report corruption and yet exit 0 in the following conditions:

  • unmatched FREE(s)
  • Livelist ALLOC ... from TXG ... FREED at TXG ...
  • DOUBLE ALLOC
  • DOUBLE FREE
  • Found block that crosses metaslab boundary
  • Found livelist blocks marked as allocated for indirect vdevs
  • while trying to open subobj id
  • path not found, possibly leaked
  • potentially leaked objects detected

So for consistency's sake, I think I should leave the exit status unchanged.

@asomers
Copy link
Contributor Author

asomers commented Apr 22, 2025

As discussed in the OpenZFS leadership meeting today, we agreed to change zdb to always exit non-zero if it detects any corruption. I'll create a distinct error code for that.

@asomers
Copy link
Contributor Author

asomers commented Apr 23, 2025

@amotin @allanjude with the latest commits, zdb will exit 3 if on-disk corruption was detected. It will also follow the usual convention of exiting 2 due to invalid CLI usage.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just randomly looking around the handled cases I found couple more.

@asomers asomers requested a review from amotin May 13, 2025 17:34
@asomers
Copy link
Contributor Author

asomers commented Aug 4, 2025

ping @amotin could you please review again?

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So be it. But please rebase it to the latest master so we have clean(er) CI tests.

asomers added 5 commits August 6, 2025 14:42
When dumping indirect blocks, attempt to print corrupt block pointers
rather than abort the program.

Sponsored by:	ConnectWise
Signed-off-by:	Alan Somers <[email protected]>
@behlendorf
Copy link
Contributor

@asomers I manually rebased this PR to get an updated CI run.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 6, 2025
@behlendorf behlendorf self-requested a review August 6, 2025 21:44
@behlendorf behlendorf merged commit d3c1d27 into openzfs:master Aug 12, 2025
25 checks passed
@asomers asomers deleted the zdb-corrupt-bp branch August 13, 2025 16:51
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
When dumping indirect blocks, attempt to print corrupt block pointers
rather than abort the program.  When corruption is detected zdb will
exit with an error code of 3.

Sponsored by:	ConnectWise
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Signed-off-by:	Alan Somers <[email protected]>
Closes openzfs#17166
lundman pushed a commit to openzfsonosx/openzfs-fork that referenced this pull request Jan 30, 2026
When dumping indirect blocks, attempt to print corrupt block pointers
rather than abort the program.  When corruption is detected zdb will
exit with an error code of 3.

Sponsored by:	ConnectWise
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Signed-off-by:	Alan Somers <[email protected]>
Closes openzfs#17166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants